Ask A Manager Survey
INTRODUCTION
In early 2021, the Ask A Manager blogsite ran their annual salary survey. The survey responses are stored on googlesheets, making the data accessible to all interested. In a previous post, we discussed data pre-processing and feature selection method. This post focuses two aspects,namely, 1) Reproducibility and 2) Out-of-Sample Testing.
REPRODUCIBILITY
In the previous post, we detailed the feature selection method by regularised regreesion, specifically, LASSO regression. In this post, we will attempt to reproduce the feature selection method.
OUT OF SAMPLE TESTING
In this post, we will also follow the traditional machine learning workflow including, splitting the data into a training and testing samples, fitting a model, cross-validation and finally fitting the model on the out-of-sample dataset(testing dataset). This approach can inform us about the model’s parsimony. In other words, can the model perform well on an unknown sample.
LASSO REGRESSION
As mentioned above, the data cleaning process was detailed on an earlier post. It is worth highlighting that the dataset contained several dummy variables indicating various respondent attributes such as job_title, industry, city, age group etc. Our dependent variable is the annual salary. Given the large number of independent variables. In the code below, we fit a LASSO regression model.
IMPORTANT VARIABLE EXTRACTION
The resulting object is a list of class “glmnet”. Essentially, the results contain all the iterations through alpha and to find the minimum lambda. As such, there are several results in the object. We are primarily interested in extracting the remaining variables at minimum lambda along with their coeffiecients. The vip package can automate the plotting of the results. However, we want to extract the variables names in order to fit them on our training test. There is probably a package to assist with this step somewhere in the wild, however, we haven’t found it yet. Luckily in R, we can write custom functions. The code below details the variable_extractor function.
see code
variable_extractor <- function(a_list){
min_lambda <- data.frame(best_lambda =a_list[["lambda"]]==min(a_list[["lambda"]]))
min_lambda <- cbind.data.frame(data.frame(index = rownames(min_lambda)),
min_lambda)
min_lambda <- min_lambda %>%
filter(best_lambda == TRUE)
min_lambda <- as.double(unique(min_lambda$index))
lasso_variables <- as.matrix(coef(a_list))|>data.frame()
lasso_variables <- cbind.data.frame(variables = rownames(lasso_variables),
lasso_variables)
lasso_variables <- tibble(lasso_variables)
lasso_variables <- lasso_variables[,c(1,min_lambda+1)]
names(lasso_variables) <- c("variable","importance")
lasso_variables <- lasso_variables %>%
filter(variable != "(Intercept)",
importance != 0.00000000) %>%
arrange(desc(importance))
return(lasso_variables)
}
lasso_variable <- variable_extractor(lasso)
The function takes a list of attribute “glmnet”. Thereafter, we find the smallest lambda. In addition, we extract all the coefficients from the object and store them as a wide dataframe. The dataframe is subset to only contain the variable name along with the coefficients of the smallest lambda value. We discard the intercept and variable with coefficients that are equal to 0.0000000. The final output is identical to the variables extracted by the vip package. Finally, we subset both the training dataset and the testing dataset to only contain the selected independent variables. Since the outcome variable is expressed in USD terms, we log the outcome variable.
TIDYMODELS: A CLEAN INTERFACE FOR MACHING LEARNING
The R programming language doesn’t not lack methods for running machine learning algorithms, it is after all, a statistical programming language. The tidymodels metapackage aims to provide a standard interface for modelling and machine learning using tidyverse principles. Below, we use the package to complete a number of steps including, crossfold_validation, model specification, fitting on the resamples and finally fitting the model on the training dataset.
see code
Salary_Folds <- vfold_cv(Salary_Train)
lm_spec <- linear_reg(engine = "lm")
lm_recipe <- recipe(new_annual_salary ~.,Salary_Train) %>%
step_nzv(all_predictors())
lm_wf <- workflow(lm_recipe,lm_spec)
doParallel::registerDoParallel(cores = 10)
ctrl_preds <- control_resamples(save_pred = TRUE)
cv_results <- fit_resamples(lm_wf,Salary_Folds,control = ctrl_preds)
lm_wf <- fit(lm_wf,Salary_Train)
collect_metrics(cv_results,summarize = FALSE) %>%
filter(.metric == "rsq") %>%
summarise(avg_estimate = mean(.estimate),
.groups = "drop")
# A tibble: 1 × 1
avg_estimate
<dbl>
1 0.268
Across all 10 validation folds, the linear regression model’s adjusted R-square averaged 0.272. It is possible to tune the parameters and use battery of other machine learning models to improve performance. In the previous post, we utilised random forest to try an improve the model. Despite, requiring additional computational power, the increases in peformance were marginal to neglible.
At this point, we have completed our first objective. We are able to reproduce the LASSO regresison results in the previous post. The next objective is to determine the parsimony of the model. Here, we fit model on the test data.
RESULTS
Training Dataset
Linear Regression Model: Training Dataset | |
Depedent Variable: log(Annual Salary(USD)) | |
Characteristic | Beta (SE)1,2 |
---|---|
(Intercept) |
11*** |
years_of_experience_in_field_X21…30.years |
0.43*** |
job_title_director |
0.32*** |
years_of_experience_in_field_X11…20.years |
0.30*** |
industry_government |
0.30*** |
highest_level_of_education_completed_PhD |
0.26*** |
city_york |
0.25*** |
years_of_experience_in_field_X8…10.years |
0.24*** |
job_title_engineer |
0.23*** |
industry_tech |
0.27 |
job_title_senior |
0.15*** |
years_of_experience_in_field_X5.7.years |
0.15*** |
industry_health |
0.15 |
industry_finance |
0.28 |
how_old_are_you_X35.44 |
0.12*** |
how_old_are_you_X45.54 |
0.11*** |
job_title_manager |
0.11*** |
how_old_are_you_X25.34 |
0.08*** |
industry_computing |
0.03 |
highest_level_of_education_completed_Master.s.degree |
0.04*** |
industry_engineering |
0.04 |
industry_manufacturing |
0.04 |
job_title_analyst |
0.04* |
years_of_experience_in_field_X2…4.years |
0.04* |
gender_Woman |
-0.03*** |
industry_administration |
-0.04 |
industry_care |
-0.08 |
industry_banking |
-0.23 |
race_White |
-0.12*** |
overall_years_of_professional_experience_X11…20.years |
-0.14*** |
overall_years_of_professional_experience_X21…30.years |
-0.14*** |
overall_years_of_professional_experience_X8…10.years |
-0.15*** |
overall_years_of_professional_experience_X5.7.years |
-0.18*** |
industry_education |
-0.19*** |
industry_nonprofits |
-0.19*** |
highest_level_of_education_completed_Some.college |
-0.20*** |
overall_years_of_professional_experience_X2…4.years |
-0.22*** |
industry_public |
-0.22*** |
job_title_assistant |
-0.24*** |
Adjusted R-square: 0.2732 | |
1 *p<0.05; **p<0.01; ***p<0.001 | |
2 SE = Standard Error |
Test Dataset
Linear Regression Model: Test Dataset | |
Depedent Variable: log(Annual Salary(USD)) | |
Characteristic | Beta (SE)1,2 |
---|---|
(Intercept) |
11*** |
years_of_experience_in_field_X21…30.years |
0.30*** |
job_title_director |
0.33*** |
years_of_experience_in_field_X11…20.years |
0.27*** |
industry_government |
0.19 |
highest_level_of_education_completed_PhD |
0.27*** |
city_york |
0.27*** |
years_of_experience_in_field_X8…10.years |
0.21*** |
job_title_engineer |
0.27*** |
industry_tech |
-0.23 |
job_title_senior |
0.17*** |
years_of_experience_in_field_X5.7.years |
0.10** |
industry_health |
0.15 |
industry_finance |
0.16 |
how_old_are_you_X35.44 |
0.07* |
how_old_are_you_X45.54 |
0.10** |
job_title_manager |
0.14*** |
how_old_are_you_X25.34 |
0.03 |
industry_computing |
0.47 |
highest_level_of_education_completed_Master.s.degree |
0.03* |
industry_engineering |
0.26* |
industry_manufacturing |
-0.18 |
job_title_analyst |
0.08** |
years_of_experience_in_field_X2…4.years |
0.00 |
gender_Woman |
-0.03 |
industry_administration |
0.14 |
industry_care |
-0.07 |
industry_banking |
-0.08 |
race_White |
-0.10*** |
overall_years_of_professional_experience_X11…20.years |
-0.05 |
overall_years_of_professional_experience_X21…30.years |
-0.05 |
overall_years_of_professional_experience_X8…10.years |
-0.05 |
overall_years_of_professional_experience_X5.7.years |
-0.10* |
industry_education |
-0.20*** |
industry_nonprofits |
-0.19*** |
highest_level_of_education_completed_Some.college |
-0.26*** |
overall_years_of_professional_experience_X2…4.years |
-0.12** |
industry_public |
-0.30*** |
job_title_assistant |
-0.17*** |
Adjusted R-square: 0.2624 | |
1 *p<0.05; **p<0.01; ***p<0.001 | |
2 SE = Standard Error |
CONCLUSION
The model performs well on an out-of-sample dataset with year of experience (21 - 30 years), job title (Director) and highest level of education (PhD) being among the most important predictors of annual salary. Provided the questions utilised in 2022 are similar or identical to the 2021 survey, the model above can be evaluated on the 2022 survey results.
REFERENCES
Silge,J. 2021.Fit and predict with tidymodels for #TidyTuesday bird baths in Australia. Available From: https://juliasilge.com/blog/bird-baths/ (Accessed 24 December 2021).
Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Sam Firke (2021). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.1.0. https://CRAN.R-project.org/package=janitor
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. https://www.jstatsoft.org/v33/i01/.
Richard Iannone, Joe Cheng and Barret Schloerke (2021). gt: Easily Create Presentation-Ready Display Tables. R package version 0.3.1. https://CRAN.R-project.org/package=gt
Thomas Mock (2021). gtExtras: A Collection of Helper Functions for the gt Package. R package version 0.2.2.11. https://github.com/jthomasmock/gtExtras
Sjoberg DD, Whiting K, Curry M, Lavery JA, Larmarange J. Reproducible summary tables with the gtsummary package. The R Journal 2021;13:570–80. https://doi.org/10.32614/RJ-2021-053.